A Comparison of Whitespace Normalization Methods in a Text Art Extraction Method with Run Length Encoding
نویسنده
چکیده
Text based pictures called text art or ASCII art can be noise in text processing and display of text, though they enrich expression in Web pages, email text and so on. With text art extraction methods, which detect text art areas in a given text data, we can ignore text arts in a given text data or replace them with other strings. We proposed a text art extraction method with Run Length Encoding in our previous work. We, however, have not considered how to deal with whitespaces in text arts. In this paper, we propose three whitespace normalization methods in our text art extraction method, and compare them by an experiment. According to the results of the experiment, the best method in the three is a method which replaces each wide width whitespace with two narrow width whitespaces. It improves the average of F -measure of the precision and the recall by about 4%.
منابع مشابه
A Decision Tree-based Text Art Extraction Method without any Language-Dependent Text Attribute
Text based pictures called text art or ASCII art are often used in Web pages, email text and so on. They enrich expression in text data, but they can be noise for text processing and display of text. For example, they can be obstacle for text-to-speech software and natural language processing, and some of them lose their shape in small display devices. With Text art extraction methods, which de...
متن کاملComparison of Count Normalization Methods for Statistical Parametric Mapping Analysis Using a Digital Brain Phantom Obtained from Fluorodeoxyglucose-positron Emission Tomography
Objective(s): Alternative normalization methods were proposed to solve the biased information of SPM in the study of neurodegenerative disease. The objective of this study was to determine the most suitable count normalization method for SPM analysis of a neurodegenerative disease based on the results of different count normalization methods applied on a prepared digital phantom similar to one ...
متن کاملN-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus
In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that alt...
متن کاملCLEFeHealth 2014 Normalization of Information Extraction Challenge using Multi-model Method
This work focuses on making clinical documents easier to understand for patients and clinical workers. Normalization values of ten attributes have been predicted by the multi-model method which alternatively uses rule based methods and machine learning methods to solve different attribute problems. Information of text structure, lexical, and grammatical features are used to achieve overall aver...
متن کاملText Extraction from Document Images- A Review
Text extraction in an image is a challenging task in the computer vision. Text extraction plays an important role in providing useful and valuable information. This paper discusses various approaches such as Adaptive Local Connectivity Map (ALCM), Expectation Maximization (EM), Maximization Likelihood (ML), Markov Random Field (MRF), Spiral Run Length Smearing Algorithm (SRLSA), Curvelet transf...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011